Final Project Submission

Please fill out:

Introduction

This linear regression analysis of the King County housing dataset will serve to inform real estate construction companies in Washington looking to invest in building a new home or residential community. The goals are to discover which features of a home increase its sale value and to quantitatively define the relationship between those features and the price of a home.

This dataset contains over 15k data points representing house sales in King County between May 2014 - May 2015. Each data point includes data on amenities, number of bedrooms/bathrooms, square footage, renovations, location, views, and sale price.

PREPARING AT THE DATA

I'll import the dataset and convert the ordinal string variables to integers to allow for early correlation screening.

As price is heavily skewed, I will remove outliers.

Thats better.

IDENTIFYING INTERACTIONS

There appears to be a strong relationship between sqft_above and sqft_living. This makes sens of course, as they include eachother in their calculations. To avoid issues of colinearity, it makes sense to only include sqft_living then, as it has the strongest correlation with price. It also seems that all of the predictors have a similarly skewed distribution as the price except for grade. Grade is defined by The King County Assessor website as corresponding to the quality of contruction and materials and thus the cost of construction.

BASELINE MODEL

Interpretation

The baseline model performed decently well, statistically signifficantly explaining about 65% of the variance in sale price.

All of the predictors were statistically significant:
For each additional bathroom, there is a -7263.03 change in price
For each additional sqft_living, there is a +89.07 change in price
For each additional sqft_living15, there is a +33.78 change in price
For each additional lat, there is a +598492.93 change in price
For each additional grade, there is a +59058.73 change in price
Houses on a waterfront have a 131,255.15 higher price tag compared to those not on a waterfront
Houses with a "fair" view have a 95,636.39 higher price tag than those with "no view"
Houses with a "average" view have a 90,037.47 higher price tag than those with "no view"
Houses with a "good" view have a 105,346.57 higher price tag than those with "no view"
Houses with a "excellent" view have a 173,596.66 higher price tag than those with "no view"

It is odd that an increase in bathrooms supposedly decreases the price, so I will investigate that.

The assumptions of Linearity, Indepencance, Normality, Homoscedasticity (equal variances of the residuals/errors) hold in this model. The Heterscedasticity plot and qq-plot look slightly irregular however, an as our target variable price had a slightly irregular distribution, I will log transform it to see if this improves the model.

Interpretation

The model remains significant with an increase in R2 from 0.648 to 0.668 (+0.02). We can also see the qq-plot looks even more like a normal distribution, and our rainbow pvalue increased. The Heterscedasticity plot looks slightly more Homoscedastic, but the Goldfeld-Quandt Test says otherwise. The log transformation has also made the bathrooms coefficient positive. I do not want to log transform the predictors, so I will look at adding another predictor from the original data.

Looking for interaction terms, I stumbled upon this interaction between sqft_living and floors, where it seems that there are ~3 separate populations within floors. I will add floors to the independant variables as dummies as it is a categorical variable.

Interpretation

This seems to have slightly improved the model. The adj. R2 increased modestly from 0.668 to 0.677 (0.009), and reduced our Heteroscedasticity. floors_1.5 and floors_2.5 seem to be statistically signifficant, while the others are not.

Now I will see what happens when I create the interaction terms between sqft_living and floors.

Interpretation

This did not seem to improve the model, as our adj R2 only increased by 0.001, our heteroscedasticity increased, and our model has become more complex to interpret. Finally, I will try applying the natural log to sqft_living and sqft_living15 as they seemed to have non-normal distributions and may have a non-linear relationship with price.

CONCLUSION

I don't see many ways to improve the model much further. I will select this fifth model to best describe this data. The analysis follows:

First, I will remodel the same data with the independent variables normalized in order to compare their effects on the model.

Interpretation

The model statistically signifficantly explains ~68% of the variance in sale price.

From the normalized results, we can see that sqft_living, grade, and lat have the most effect on our model. Of those, sqft_living and grade are the most actionable for the purposes of a construction company.

For each additional bathroom, we see an associated increase of ~1.31% in price.
For each additional grade, we see an associated increase of ~13.35% in price.
For each additional degree of latitude in King County, we see an associated increase of ~321.30% in price.
For each increase of 1% sqft_living, we see an associated increase of ~0.32% in price.
For each increase of 1% sqft_living15, we see an associated increase of ~0.17% in price.
Houses on a waterfront have a ~42% higher price compared to those not on a waterfront.
Houses with a "fair" view have a ~20.5% higher price tag than those with "no view".
Houses with a "average" view have a ~18.6% higher price tag than those with "no view".
Houses with a "good" view have a ~22.4% higher price tag than those with "no view".
Houses with a "excellent" view have a ~31.4% higher price tag than those with "no view".
Houses with 1.5 floors have a ~14.8% higher price tag than those with one floor.
Houses with 2.5 floors have a ~11.1% higher price tag than those with one floor.

Discussion and Caveats

Though we have a decent model, several of the predictors are not particularly useful. Construction companies can rarely control whether their houses will be waterfront properties, have excellent views, or be further from the equator. I considered removing latitude as it did not seem logical to me, but I left it in as I don't know enough about the geography of King County, Washington to say definitively that northernmost areas are considered more desireable. For instance, maybe northernmost houses are closer to big cities, or are nearer to better school districts. Grade, while a useful variable describing the quality of contruction and materials, is not speciffic enough to generate any real insight other than "you get what you pay for".